-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12936][SQL] Initial bloom filter implementation #10883
Conversation
58e7c59
to
cbc53f2
Compare
cc @rxin @liancheng |
cbc53f2
to
bbf3822
Compare
Test build #49939 has finished for PR 10883 at commit
|
bbf3822
to
2db0171
Compare
Test build #49940 has finished for PR 10883 at commit
|
2db0171
to
1617226
Compare
Test build #49941 has finished for PR 10883 at commit
|
import java.io.UnsupportedEncodingException; | ||
import java.util.Arrays; | ||
|
||
public class DefaultBloomFilter extends BloomFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be consistent in naming with the count-min sketch one, i.e. rename this BloomFilterImpl.
69ac65d
to
920f292
Compare
return bits.bitSize(); | ||
} | ||
|
||
private static long hashObjectToLong(Object item) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string part is same with count-min, but long part is different, is it worth to abstract it? cc @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the long part different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because they use different strategy to produce n hash values for a long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add some inline comment explaining why they are different
Test build #50002 has finished for PR 10883 at commit
|
while (i < numInsertion) { | ||
filter.put(allItems(i)) | ||
i += 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to use allItems.take(numInsertion).foreach(filter.put)
for simplicity if this code path doesn't slow down test execution too much.
Test build #50005 has finished for PR 10883 at commit
|
Test build #50012 has finished for PR 10883 at commit
|
// After merge, `filter1` has `numItems` items which doesn't exceed `expectedNumItems`, so the | ||
// `expectedFpp` should not be significantly higher than the default one: 3% | ||
// Skip byte type as it has too little distinct values. | ||
assert(typeName == "Byte" || 0.03 - filter1.expectedFpp() < 0.001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the 0.001 here stand for epsilon? Should we wrap 0.03 - filter1.expectedFpp()
with Math.abs()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to special case "Byte" here when we already have numItems
(it's 200 for Byte type)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to make 0.01 a private static constant, e.g. EPSILON
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we should swap 0.03
and filter1.expectedFpp()
here.
One high level comment about testing code: I'm a little bit confused by those magic numbers there. Would be nice to name them. |
* Creates a {@link BloomFilter} with given {@code expectedNumItems} and a default 3% {@code fpp}. | ||
*/ | ||
public static BloomFilter create(long expectedNumItems) { | ||
return create(expectedNumItems, 0.03); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a public static constant DEFAULT_EXPECTED_FPP
for this 0.03. I was pretty confused when finding the magic number 0.03 in the testing code.
Tip: I renamed |
testItemType[Long]("Long", 100000) { _.nextLong() } | ||
|
||
testItemType[String]("String", 100000) { r => r.nextString(r.nextInt(512)) } | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to add another test case for incompatible merge (like this one).
Test build #50031 has finished for PR 10883 at commit
|
Test build #50034 has finished for PR 10883 at commit
|
LGTM |
As discussed offline, it might be better to have put(String) and put(Long) in addition to put(Object). I'm going to merge this pull request. Please address the remaining comments in a new pull request. |
This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the
BloomFilter
class in guava.Some difference from the design doc:
bitSize
instead ofsizeInBytes
to user.expectedInsertions
parameter when create bloom filter.